Buying a new house is hell of a difficult task. Prices of houses vary because of many factors like number of bedrooms, size of basement and many more. Thus we have made it easy for buyer to search their dream house by developing a program that will predict house price depending on various attributes of the house. We have considered 24 attributes(database is described in the execution) to predict 1 variable i.e. SalesPrice. This program will help buyers to understand that if a particular house is in their budget or not. Many times broker increases the price of the house just to earn extra commision. Buyers will be able to understand if the broker is telling the actual price or not. This program can also help real estate brokers to sell the property at right price.
It is our job to predict the sales price for each house. For each Id in the test set, we must predict the value of the SalePrice variable. We will also find attributes which majorly affect SalesPrice.
This program will help :
1. House buyers
2. Real estate brokers
Let us begin with understanding data.
Here we import all libraries.
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
Let us read the dataset.
data=pd.read_csv("C:\\Users\\Admin\\Desktop\\Main_data.csv")
Following steps will help us understand the profile of the dataset.
import pandas as pd
import pandas_profiling
pandas_profiling.ProfileReport(data)
data.head()
Data cleaning is the process of detecting and correcting(or removing) corrupt or inaccurate records from a record set, table or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parta of the data and then replacing, modifying or deleting the dirty or coarse data.
Let us see how many values are null. False indicates that it is not a null value. True indicates it is a null value.
data.isnull()
From the above step we observed that column Fence and LowQualFinSF has many null values. Hence we will not consider those columns while predicting the prices. Let us drop the two columns.
data= data.drop(columns="Fence")
data= data.drop(columns="LowQualFinSF")
data.head()
In above step we have dropped columns with null values. Let us drop rows which contains 'NA' values. Rows will 'NA' values are of no use while predicting SalePrices of house.
df=data.dropna(how='any',axis=0)
df
We have finished cleaning the dataset. Let us see the description of our final dataset.
data.describe()
Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images.
Let us begin with visualization by importing libraries.
%matplotlib inline
import numpy as np
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes= True)
Univariate analysis involves analysis of only one variable. It doesn't deal with cause of the data. It basically summarizes the data and finds pattern between data. Most convinient way to check univariate distribution in seaborn is displot(). It draws a histogram and fit a kernel density estimate (KDE).
plt.figure(figsize=(9,8)) #this sets the size of figure
sns.distplot(df['SalePrice'],bins=50) #with this we plot histogram of SalePrice and give it's kernel density estimate.
print("Skewness: %f" % df['SalePrice'].skew())
print("Kurtosis: %f" % df['SalePrice'].kurt())
cor_mat= df[:].corr()
cor_with_tar=cor_mat.sort_values(['SalePrice'],ascending=False)
print("The most relevant features (numeric) for the target are :")
cor_with_tar.SalePrice
Note that some of the features have quite high corelation with the target. These features are really significant.
Of these the features with corelation value >0.5 are really important. Some features like BedroomAbvGr etc.. are even more important.
We will consider these features (i.e. BedroomAbvGr,LotArea) etc.. in more detail in subsequent sections during univariate and bivariate analysis.
# using a corelation map to visualize features with high corelation.
cor_mat= df[['BedroomAbvGr','LotArea','LotFrontage','SalePrice']].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)
In this section the univariate analysis is performed; More importantly the features that are more importanht with the 'Target' that have high corelation with the Target.
For Numeric features distplot' is used and 'boxplot' is used to analyze their distribution
def plot_cat(feature):
sns.countplot(data=df,x=feature)
ax=sns.countplot(data=df,x=feature)
plot_cat('OverallQual')
Bivariate Analysis is the concept to find relationship between two variables and to find how strong is this relationship. It is a form of quantitative analysis.
# garage area
fig, ax = plt.subplots()
ax.scatter(x =(df['GarageArea']), y = df['SalePrice'])
plt.ylabel('SalePrice')
plt.xlabel('GarageArea')
plt.show()
# can try to fremove the points with gargae rea > than 1200.
The salesprice increases with the increase in Garage Area.
# garage area
fig, ax = plt.subplots()
ax.scatter(x =(df['LotFrontage']), y = df['SalePrice'])
plt.ylabel('SalePrice')
plt.xlabel('LotFrontage')
plt.show()
# can try to fremove the points with gargae rea > than 1200.
The SalePrice increases with the overall quality as expected.
sns.jointplot(x="YearBuilt",y="SalePrice",data=df)
As it can be observed above maximum number of houses were sold in year 2000 and highest SalePrice was around 2800000 houses in the range from 1400000 to 200000 are having highest sale.
fig1 = fig.add_subplot(221); sns.boxplot(x='OverallQual', y='SalePrice', data=df[['SalePrice', 'OverallQual']])
It can be observed from the above box plots that as the overall quality of the houses increases, the mean sale price also increases.
Here we are plotting relationship between 3 variables.
fig4 = fig.add_subplot(224);
sns.scatterplot(x = df.GarageArea, y = df.SalePrice, hue=df.OverallQual, palette= 'Spectral')
fig7 = fig.add_subplot(122);
sns.scatterplot(y = df.SalePrice, x = df['1stFlrSF'], hue=df.OverallQual, palette= 'YlOrRd')
There are 4 bathroom variables in our data set. FullBath has the largest correlation with SalePrice between the others individually.
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(221); sns.regplot(x='FullBath', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.FullBath.corr(df['SalePrice'])))
fig2 = fig.add_subplot(222); sns.regplot(x='HalfBath', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.HalfBath.corr(df['SalePrice'])))
fig3 = fig.add_subplot(223); sns.regplot(x='BsmtFullBath', y='SalePrice',
data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtFullBath.corr(df['SalePrice'])))
fig4 = fig.add_subplot(224); sns.regplot(x='BsmtHalfBath', y='SalePrice',
data=df);plt.title('Correlation with SalePrice: {:6.4f}'.format(df.HalfBath.corr(df['SalePrice'])))
plt.show()
for col in df.columns:
print(col)
results = df.dtypes # check the data type of result table
print(results)
#Converting columns to categorical columns in order to do regression:
continuous_columns = ['Id','MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','1stFlrSF','2ndFlrSF','BsmtFullBath','BsmtHalfBath',
'FullBath','HalfBath','BedroomAbvGr','GarageArea','SalePrice','KitchenAbvGr']
categorical_columns = [col for col in df.columns if col not in continuous_columns]
print(categorical_columns)
# Carrying out hot encoding in order to assign equal weights to categorical columns
# Columns converted to catogorical columns using as_ordered.
for col_name, col in df[categorical_columns].items():
df[col_name] = col.astype('category').cat.as_ordered()
print(df.dtypes)
#Using dummy variables in order to replace categorical columns
housing_data = df.copy()
housing_train_data = pd.get_dummies(housing_data, columns=['MSZoning','LotShape','Utilities','Heating','Street'], prefix = ['MSZoning','LotShape','Utilities','Heating','Street'])
print(housing_train_data.dtypes)
print(housing_train_data)
# Plotting the correlation matrix and observe the heatmap
corrmatrix = housing_train_data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmatrix, vmax=.8, square=True);
# correlation matrix with the correlation values in heatmap
plt.figure(figsize=(100,80))
cor = housing_train_data.corr()
sns.set(font_scale=4)
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds,annot_kws={"size": 45},vmax=.8, square=True)
plt.show()
#take the threshold value as 0.2 and all the columns having value greater than threshold are selected for regression
threshold = 0.2
filtered_data = abs(cor['SalePrice'])
result = filtered_data[filtered_data>0.2]
print(result)
#form a final train data on which regression is to be done
final_train_data = housing_train_data[['MSSubClass','LotFrontage','LotArea','1stFlrSF','2ndFlrSF','FullBath','HalfBath',
'BedroomAbvGr','GarageArea','SalePrice','MSZoning_RL','MSZoning_RM']]
print(final_train_data)
# X and Y coefficients required for regression:
X = final_train_data[['MSSubClass','LotFrontage','LotArea','1stFlrSF','2ndFlrSF','FullBath','HalfBath',
'BedroomAbvGr','GarageArea','MSZoning_RL','MSZoning_RM']]
Y = final_train_data[['SalePrice']]
#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split, KFold, cross_val_score
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,test_size=1/3, random_state=0)
# 1/3rd of the train data will be selected as test data
print("xtrain : " + str(xtrain.shape))
print("xtest : " + str(xtest.shape))
print("ytrain : " + str(ytrain.shape))
print("ytest : " + str(ytest.shape))
print(xtrain)
print(ytrain)
xtrain = xtrain.fillna(0)
(np.isnan(xtrain))
xtest = xtest.fillna(0)
ytest = ytest.fillna(0)
import sklearn
from sklearn import linear_model
model_reg = linear_model.LinearRegression() # form a linear regression model
ytrain = ytrain.fillna(0)
model_reg.fit(xtrain,ytrain) # check for fit of regression model
model_reg.score(xtrain,ytrain) # check the score of regression model
model_reg.coef_
model_reg.intercept_
predictions = model_reg.predict(xtest) # we can observe the predictions in an excel file on the computer by following below path
predictionsdf = pd.DataFrame(predictions)
predictionsdf.to_excel("C:/Users/Admin/Desktop/output1.xlsx")
ytest.to_excel("C:/Users/Admin/Desktop/output1.xlsx")
from sklearn import metrics
print("Evaluation of the prediction model : Mean Absolute error")
print(metrics.mean_absolute_error(ytest,predictionsdf))
print("Mean Squared error")
print(metrics.mean_squared_error(ytest,predictionsdf))
from sklearn.linear_model import Ridge
ridgemodel = Ridge(alpha=0)
ridgemodel.fit(xtrain, ytrain)
Ridge(alpha=0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False,
random_state=None, solver='auto', tol=0.001)
ridge_score = ridgemodel.score(xtrain,ytrain) # Calculating the score of the model
print(ridge_score)
## The line / model
plt.figure(figsize=(40,30))
plt.scatter(ytest, predictions)
plt.xlabel('Actual_values')
plt.ylabel('Predicted_values')
#Plotting residual plots in order to evaluate performance of the model
y_test_df = pd.DataFrame(ytest)
y_test_df.columns = ['SalePrice']
predictionsdf.columns = ['SalePrice']
true_val = y_test_df['SalePrice'].values.copy()
pred_val = predictionsdf['SalePrice'].values.copy()
residual = true_val - pred_val
fig, ax = plt.subplots(figsize=(40,20))
plt.xlabel('Predictated value for Sale price')
plt.ylabel('Residual')
_ = ax.scatter(pred_val,residual)
We can hereby conclude that having a bedroom above ground level is highly correlated with our main attribute SalePrice and also affects the house value, the most. The other attributes that largely impact the value of sales price are Full Bath, Lot frontage, lot area and bedroom above the ground level etc. These factors need to be taken into consideration while buying a house as they largely affect the values of sales price. Also, multiple linear regression best fits our dataset as the R square value is 0.96 which is a good number.